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Using different views in analysis 


What does this look like? vasa 


A circle with a dot in the center? 
A sphere with a hole through the center? 
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ey It could be this... vasa 
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Or it could be this... vasa 
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A single view can mislead you... 
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® Conclusion S& 


As designers, you have an arsenal 
of tools, techniques, and 
personnel available to you. 


Given your available budget and 
time, we must be smart and 
efficient in how and what we do. 
That’s where you can make a 
difference. 


Questions? 


youelg sishjeuy WINS OST 
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Why are we here? 
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WORKSHOP G: Probabilistic risk assessment: The basis for 
recognizing emerging operational risks 


¢ During this session we will discuss how a systematic and 
comprehensive methodology to evaluate risks associated with 
complex engineering and technological systems can help 
companies identify emerging risk to their critical operations. 
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¢ Through the use of examples, we will explore how specific tools 
and processes can help approach operational risks with: 
> Aquantitative evaluation of system safety 
> Identification, selection, and screening of initiating events 
> Definition and modeling scenarios, Initiating and Pivotal Events, 
Modeling & Data development, and risk quantification & uncertainty 
analysis 
> Risk importance ranking and cutset analysis for risk reduction and 
communication 
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Why are we here? 


PRA is one of the tools in our S&MA toolbox. It provides both depth 
and width in evaluating systems, vehicles, vessels, facilities, and 
missions. 


It’s been used successfully in several industries, such as commercial 
nuclear power, aerospace, transportation, chemical, and medical. 


NASA continues to get budgets with high expectations from the public. 
S&MA must continue to do its job with less, thus we have to be smarter 
and more efficient. 


Today’s workshop is to help take you to the next level in understanding 
this tool and how to use it. 


Vv When to do a PRA? 
How to support/perform it? 
How to recognize a good one? 
How to use it in your risk-informed decision making process? 
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The PRA Team vasa 


° APRA system analysis team includes both system domain 
experts and PRA analysts. The key to success is multi-way 
communication between the PRA analysts, domain experts, 
and management. 


° A majority of PRA analysts have engineering degrees with 
operations and/or design backgrounds in order to understand 
how systems work and fail. This is essential in developing the 
failure logic of the vehicle or facility. 


* Good data analysts understand how to take the available data 
to generate probabilities and their associated uncertainty for 
the basic events that the modelers can use or need. 
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° Building or developing a PRA involves: 
— understanding its purpose and the appropriate modeling techniques, 
— designing how it will serve that purpose, 
— populating it with the desired failure logic and probabilities, and 


— trouble shooting it (nothing works the first time) a4 
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© PRA Overview al 


Questions a PRA can answer for your organization: 


v “What could go wrong and what are the consequences?” 
v “What is the likelihood of an undesirable event?” 
v “Where should | focus resources to reduce overall risk?” 


v “What are the uncertainties of my processes and systems?” 
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PRA Overview Sy 


NEW DEVELOPMENTS 

The ideal time to conduct a PRA is at the beginning of the design process 
to incorporate the necessary safety and risk avoidance measures 
throughout the development phase at minimal cost. 
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EXISTING SYSTEMS 

PRA can be applied to existing systems to identify and prioritize risks 
associated with operations. Risk assessments can evaluate the impact of 
system changes and help avoid compromises in quality or reliability while 
increasing productivity. 


E 


INCIDENT RESPONSE 

In the event of unexpected downtime or an accident, our team can assess 
the cause of the failure and develop appropriate mitigation plans to 
minimize the probability of comparable events in the future. 
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In a nutshell, PRA can be applied from concept to decommissioning 
during the life cycle, including design and operations. 
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PRA Overview 
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What is PRA? 


PRA is a comprehensive, structured, and disciplined approach to 
identifying and analyzing risk in engineered systems and/or processes. 
It attempts to quantify rare event probabilities of failures. It attempts to 
take into account all possible events or influences that could 
reasonably affect the system or process being studied. It is inherently 
and philosophically a Bayesian methodology. In general, PRA is a 
process that seeks answers to three basic questions: 


VY What kinds of events or scenarios can occur (i.e., what can go 
wrong)? 
What are the likelihoods and associated uncertainties of the events 
or scenarios? 
What consequences could result from these events or scenarios 
(e.g., Loss of Crew and Loss of Mission)? 


There are other definitions 


The models are developed in “failure space”. This is usually different 
from how designers think (e.g. Success space). 


PRAs are often characterized by (but not limited to) event tree models, fault 
tree models, and simulation models 
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PRA Process 


Probabilistic Risk Assessment Flow 
End States 


Examples: 
* Loss of life 
* Loss of facility 


>: Sritiown * Sequences of operation 


= 
4 * Timelines 
* Fire 


iE 
-Plicweal List of consequence * Operational Procedures 


SRC ees * Operational i 
* Leak of interest : Risk Levels for 
. Exceeding . - — : Rules/Assumptions Selected End States 


aia * Malfunction Procedures 
™ <> t 
* External event 


assessment + AA <S A 
* Training Manuals 
* System Architecture 
* Engineering Expertise 
: ' * P&IDs 
Engineering * Human Error 
Analysis is - Common Cause 
used to 
support 


*« Customer Data Relative Risk Drivers 
SUCCESS * Industry Databases 
criteria, o OREDA 

o ICON 
ese o Well Master 
time, etc. * NPRD db 

* EPRD db 

* Other Assessments 


SSSSSESESESESS 


* Hazard Reports 

* Functional 
Analyses 

« FMEAs 

« Previous risk 
assessments 


Documentation of the PRA 
supports a successful 
independent review process 
and long-term PRA application 
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PRA Development Process 
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PRA Development Process 


Defining the PRA Study Scope and Objectives 


( : 
X 


End State: LOC 


End State: LOM 


Initiating Events Identification 


Event Sequence Diagram (Inductive Logic) 


rz 


End State: OK 
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End State: LOC ) 


Event Tree (ET) Modeling 


Fault Tree (FT) System Modeling 


Logic Gate 
Lan Basic Event 
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Link to another fault tree 


~< 


Mapping of ET-defined Scenarios to Causal Events 


oOooooocda 


Internal jating events 
External initiating events 
Hardware failure 

Human error he ae ~ 
Software error oe Ae or more 
Common cause failure | ofthese | 
Environmental conditions oe } 


a 


Other 


Probabilistic Treatment of Basic Events 


3 
25 
20 
1 
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0.02 0,04 0.06 0.0 0.02 0.08 0.06 0,08 


Examples (from left to right): 
Probability that the hardware x falls when needed. 

Probability that the crew fail to perform a task 

Probability that there would be a windy condition at the time of landing 


The uncertainty in occurrence frequency of an event 
is characterized by a probability distribution 


Model Logic and Data Analysis Review 


Domain Experts ensure that system failure logic 
is correctly captured in model and appropriate data 
is used in data analysis 


ooo 


Communicating & Documenting 
Risk Results and Insights to Decision-maker 


Displaying the results in tabular and graphical forms 
Ranking of risk scenarios 

Ranking of individual events (e.g., hardware failure, 
human errors, etc.) 

Insights into how various systems interact 
Tabulation of all the assumptions 

Identification of key parameters that greatly influence 
the results 

Presenting results of sensitivity studies 

Proposing candidate mitigation strategies 


Technical Review of Results and Interpretation 


Model Integration and Quantification of Risk Scenarios 


End State: LOC ) 
End State: LOM )) 


Integration and quantification of 
logic structures (ETs and FTs) 
and propagation of epistemic 
uncertainties to obtain 


minimal cutsets (risk 
scenarios in terms of basic 
events) 

likelihood of risk scenarios 


uncertainty in the 
likelihood estimates 
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® PRA Development Process (2) S& 


° Defined the scope of the PRA 
— Start with the end in mind or the question you want answered. For 
example, loss of hydrocarbon containment and loss of life failure end 
states 
— Define mission scope, 
— Establish the mission/operational phases and layout the mission level 
event trees and corresponding top events to be analyzed 


°* Develop logic models 

— Assign top events to system analysts for each subsystem and work with 
domain experts to develop fault trees 

— System analysts work with data analysts and domain experts to 
determine level of detail and failure logic (develop fault trees to the level 
that data exists) 

— eat appropriate project office concurrence of system models (fault 
trees 
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PRA Development Process (3) S& 


° Develop failure data into failure probabilities 
— Obtain specific failure history or best available generic data 


— Data analysts calculate failure probabilities based on best available data 
and approved methods 


gclaceral 


° Quantify the model, perform sanity checks, re-iterate until Team 
is in agreement 
— Quantify the integrated model and perform sanity checks to determine 
which simplifying model assumptions need to be re-evaluated, where 
uncertainties need to be narrowed, where additional deterministic 
analyses are needed 


E 


° Shares results with program and projects 
— Risk ranking and risk insights 
— Incorporate feedback into PRA and into program/project design/ops 


— Maintain “Living PRA” to represent new program information (data 
updates) and evolving model scope 
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® Simple Example of a Small PRA model Sy 


The spacecraft is designed with two redundant 


sets of thrusters (independent of each other) 


Each propellant distribution module consists 
a hydrazine tank, filters, distribution lines, 
normally-open isolation valves, sensors, 
heaters, etc. (only components that affect 
mitigation of leaks are shown) 


When thruster operation is needed, the 
controller opens the solenoid valves (not 
shown) to allow hydrazine to flow 


The controller monitors the pressure of feed- 
lines via pressure transducers (P1 and P2). Itis 
designed to differentiate between the normal 
thruster operation and a leak 


In the event of a leak, isolation valves (V1 and 
V2) should both close 

Successful termination of the leak leads to the 
loss of one but not both, thruster sets 

Failure to terminate the leak can cause damage 
to the flight critical avionics and/or damage to 
scientific equipment: 


- Hydrazine acts as a wire stripper and is 
corrosive 


to one set of 
thrusters 


Pressure Transducer Isolation Valve 


Simplified Schematic of Propellant 
Distribution Module 
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Example of Event Sequence Diagram 
ESD 


damage to 
scientific 
equipment 


Hydrazine 
leaks 
) 


yes 


Better viewed as 
good things are 
joMeymcomialemare]nne 
F-Valo Ws oy=tolm tallave|s 
are down (i.e. 
success is up or 
icon uatomale]aiar- lure 
failure is down) 


damage to 
scientific 
equipment 


yes 


These 
statements are 
made under 
different 
conditions 
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The ESD Translated Into an Event Tree 


Hydrazine damage to 


leaks scientific 
equipment 


damage to flight dated ie | 
Hydrazine leaks Leak not detected Leak not isolated ine ae scientific End state 
critical avionics equipment 


Better viewed as 
foforeXemalialetsw-leome]°) 
FeVaom oy-\omtaliale[sir-las 
down, i.e. success 
up and failure down 


JSC S&MA Analysis Branch 


\— —— S- — — — — 


v 


Leak not 
detected 


OR 


LD 


ault Trees Are Attached to the Event Tree 


-—— 


Controller fails 


Common cause 
failure of P 
transducers 


Pressure 
transducer 1 
fails 


Pressure 
transducer 2 
fails 


@ 


‘ Leak not 
Hydrazine leaks poner 


, damage to 
damage to flight yo cee 
critical avionics scientific End state 
equipment 


PRA model embodies a collection of 
various models (logic, reliability, 
simulation and physical, etc.) in an 
integrated structure 
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Common Cause 
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Common Cause asa 


Definition Of Common Cause Failure (CCF) 

some basics 

Types Of CCF Models 

Examples of common cause 

Deriving common cause parameter values from data 


Examples of Beta’s calculated from real data (NASA 
and Nuclear) 


Conclusions 
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Common Cause Modeling asa 


(More details and examples on this later) 


° All large PRAs of complex and redundant machines must include 
“common cause” effects to be complete and accurate 


* Common Cause are those conditions that defeat the benefits of 
redundancy 
— Not “single point failures” 
— Similar to “generic cause” 


° There are three recognized ways to perform common cause modeling: 
— The Beta Model 
— The Multiple Greek Letter Model 
— The Alpha Model 


° We use an iterative approach to modeling common cause first the 
Beta Model approach is used and if it shows up as a risk driver a 
Multiple Greek Letter Model is used 
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°* Generic data from NUREG/CR-5485 for the majority of the events since 
there are few cases where there is enough Shuttle data to develop 
Shuttle specific values 
— RCS Thrusters and ECO sensors are examples of cases where Shuttle specific 
data is used to calculate the common cause parameters 
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Common Cause Modeling (2) S& 


HOW THE BETA MODEL APPROACH WORKS 


* Susceptibility groups (groupings of similar or identical equipment) of 
redundant trains or components are identified 


° Acommon cause basic event is defined for these groups 


° The common cause basic event failure rate is generated by taking the 
independent failure rate times a “Beta” factor. 
— For the beta model it does not matter how many components are in the group 


— The “Beta” factor represents the probability of 2 or more failures given a failure has 
occurred 
> For this reason, the Beta Model may be conservative for component groups larger than 2. 
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° The “Beta” factor is taken from NUREG/CR-5485 and has a different 
value for “Operating” failures vs. “Demand” failures 
— Operating failures the “Beta” value is 0.0235 
— Demand failures the “Beta” value is 0.047 
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® Common Cause Modeling (3) vasa 


HOW THE MULTIPLE GREEK MODEL APPROACH WORKS 


¢ Similar to the Beta Model except that the Multiple Greek Model takes credit 
for the full redundancy and therefore can be much more complicated 


— Fora3component group, there is a “beta” factor and a “gamma” factor where 
the “beta factor is still the probability of 2 or more failures and the “gamma” factor 
is the probability of 3 or more failures given 2 or more failures. 
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Common Cause Definition S& 


9 


“In PRA, Common Cause Failures (CCFs) are failures of two or 
more components, subsystems, or structures due to a single 
specific event which bypassed or invalidated redundancy or 
independence at the same time, or in a relatively short interval 
like within a single mission 


; May be the result of a design error, installation error, or maintenance 
error, or due to some adverse common environment 
- Sometimes called a generic failure. 


** Common Cause, as used in PRA, is not a single failure that takes 
out multiple components such as a common power supply to 
computers or common fluid header to multiple pumps. 
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- Single point failures such as these are modeled explicitly ina PRA 
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Some Basics on PRA and 
Common Cause Failures 


© PRA 


= Bo used to perform “rare event” analysis 


If we had 1000 Space Stations operating for 50 years each and we had lost 60 of them we 
eee not need to do a PRA to determine what the loss of station failure rate was 


However, we have only had one Station operating for ~ 10 years with no loss of station so 
methods like PRA are needed to estimate this value 


— Most of the components used in space vehicles are designed to be low failure rates 
and limited numbers of these components mean that an actual failure rate number is 
difficult to calculate from operational data (uncertainty is high!) 


Common Cause Parameters 


— Beta is modeled as a fraction of the total failure rate. 
Total failure rate = Independent failure rate + common cause failure rate 
Beta = common cause failure rate / Total failure rate 
This is ~ to common cause failure rate / independent failure rate (when Beta is small) 


— If you have a low failure rate for a component, the common cause failure rate will 
be low too but could still have a high Beta factor 


— A failure rate is a rate such as Failures per hour and a Failure probability is derived by 
the equation of 1-e"' where | is the failure rate. When It is a small value the equation 
can be simplified using the rare event approximation and we get Failure probability ~ It. 


Note: Beta is a parameter of a single modeling method, and there are several 


modeling methods and variations most work in similar fashion 
32 


Types Of Common Cause Models vasa 


Common Cause is modeled as a conditional 
probability, i.e. Given that a component has failed, what 
is the probability that another like component will fail 


Common models used are: 


- Beta (8) model — For a system with multiple like 
components, Beta factor is used to estimate the probability 
of failure of all components (i.e. two or more) 


- Values for Beta can range from | to 0.0001 (or less), 
but more typical values are usually between 0.1 and 
0.001 


- Multiple Greek Letter (MGL) model — For systems with 3 
or more like components, provides for a more explicit 
breakdown of possibilities, probabilities of two, three, 
four, etc. component failures 


- Alpha (a) model — Similar to the MGL model a 


xample Of Impact Of Modeling Common Cause vasa 


A system consisting of two trains: 


Without Considering Considering Common 
Common Cause Cause 


Beta (B) 
Ee | 
ai (4805 | 


COMMON CAUSH FAILURE OF TWO 
FAILURE OF TW 
IRICOE PATHS 
e 4.7E-5 ( 


EVENT-4-0 


©) ase ©) aoa 


VALVE_A_FAILS VALVE_B_FAILS 
1.0E-3 @ 1.0E-3 


VALVE_A_FAILS VALVE_B_FAILS 


Results in a ~ 4.7E-05 Underestimate of Risk Which is 48 
Times the Risk Without Considering Common ane 


Qe ore Of Impact Of Modeling Common Cause vasa 


A system consisting of three trains: 


Without Considering Considering Common 
Common Cause Cause (Beta Model) 


FAILURE OF 
THREE PATHS 


COMMON 
AUSE 


VALVE_B_FAILS VALVE_C_FAILS 


VALVE_A_FAILS VALVE_B_FAILS VALVE_C_FAILS 


Results in a ~ 4.7E-05 Underestimate of Risk Which is 47,000 
Times the Risk Without Considering Common Cause 


Note: Using a MGL Model Would Reduce Result to 2.6E-8 


Types Of Data That Exist In The Models vasa 


° Functional — A functional failure event is generally defined as failure of a 
component type, such as a valve or pump, to perform its intended function. 
Functional failures are specified by a component type (e.g., motor pump) and 
by a failure mode for the component type (e.g., fails to start). Functional 
failures are generally defined at the major component level such as Line 
Replaceable Unit (LRU) or Shop Replaceable Unit (SRU). Functional failures 
typically fall into two categories, time-based and demand-based. Bayesian 
update as Shuttle specific data becomes available. 


° Phenomenological — Phenomenological events include non-functional events 
that are not solely based on equipment performance but on complex 
interactions between systems and their environment or other external factors 
or events. Phenomenological events can cover a broad range of failure 
scenarios, including leaks of flammable/explosive fluids, engine burn through, 
overpressurization, ascent debris, structural failure, and other similar 
situations. 
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° Human - Three types of human errors are generally included in fault trees: 
pre-initiating event, initiating event (or human-induced initiators), and post- 
Initiating event interactions. 


* Common Cause —- Common Cause Failures (CCFs) are multiple failures of 
similar components within a system that occur within a specified period of time 
due to a shared cause. 


°* Conditional — A probability that is conditional upon another event, i.e. given 
that an event has already happened what is the probability that successive 
events will fail 


Notional PRA Examples Sy 


First the Math 


1.0E-02 = 0.01 = 1:100 (Probable) =» ~Shuttle Mission Risk 

1.0E-06 = 0.000001 =» 1:1,000,000 (Improbable) => having 20 coins 
simulaneously landing 
on tails 


1.0E-12 = 0.000000000001 => 1:1,000,000,000,000 (ridiculous) 
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Time Perspective 


13 
1.2 x 10'* hours ago el nOnisag° 2 x 10'*-— 7 x 10" hours ago 


~14 billion years ~4.5 billion years ago 598 _ g9 million years ago 
ago 


8 
4X 10" nous age 2.1 x 106 hours ago 6.3 x 10° hours ago 
~46,000 years 
Bae ~240 years ago ~72 years ago 
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Risk Regression Example 


Ee ere rer *Design Change #2 i 
*Design Change #1 i Raniseuenesds ‘ __vvteneeenaneessanenssanecssaneessnsessencesanscal i 
Se, ee : a Se nara eee cas Design Change #13 
one = o + Design Change #3 i i 
by a ssncesasnsnnaennnssntnnssnnssnsnnnannnnene 
“hy ie set gaan MG UACREAURAUATEEnmEeE ||| atte neodeeaariy 
Ute 10 et Pasa teeteaeeeteeee acai eees 
i 1:10 i + Design Change #4 : : 
0.1 Beeseesennge —tnnennuencnssenenssensnssensnssensnass i + Design Change #10 
ional recdsfbssruaeacitavexensatateivseereneesarteecennede : : 
1-1 I Me ia 
0.08 Ss eclibnec ee ese tee 
= |r f F 9 E wesignchange#e = be 
= 0.06 vs 
Q 
o Eeasiens Fr iar ss eave ae eo ee os, a 
_ , i H :% ahaa ibaidateatestsdeatescatissdh uddbacstesterisiias 
= oo jf f fs 
ou | ee | yee reer 
lal a? Seneneneneae qoeeSeeeasaeeeaeesaeesanseesaeeeaeesaeed 
0.02 “90 
0 


10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 


Flight Sequence # 


This chart shows how calculated risk changed following design and ops 
changes over a 30 year program by peeling back the “onion” (starting at the 
end and undoing changes). Note that risk doesn’t decrease according to a 
nice exponential curve, but only after something fails and it gets “fixed”. 
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Uncertainty Distribution 


© This distribution is a representation of the uncertainty associated with a PRA’s results 
© The median is also referred to as the 50‘ percentile 


Median - 1/94 
Mean — 1.1E-02 (1:90) Mean - 1/90 


Median — 1.1E-02 (1:94) 


5% percentile — 7.9E-03 (1:127) 


100 


7 bei. : : 
95'" percentile — 1.6E-02 (1:63) Sth - 1127 | 


2 
2 
oO 
a 
2 
2 
os 
2 
2 
a 


3 


4 95th - 1/63 


0 


4.0E-03 6.0E-03 80E-03 1.0E-02 12E-02 146-02 166-02 1.8£-02 2.0E-02 2.2E-02 24E-02 
Probability 


The 5" and 95" percentile are common points on a distribution to show the range that 90% 
of the estimated risk lies between. 


@ The mean is a common measure of risk that accounts for uncertainty or this distribution, thus 


the value or metric used to verify LOC requirements. ; 
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Notional Ascent Risk Profile vasa 
(not a direct output of PRA) 


9.00 


SRB Separation mm Normalized Risk 
=== Cumulative Risk 


8.00 
7.00 
Liftoff 


6.00 


5.00 


umulative Risk 


Normalized Risk 


ET Separation 
3.00 


This chart builds off of 
PRA results (that are 
time averaged over 

ascent), thus requires 

post-processing to get 
this profile. 


2.00 


1.00 


0.00 
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System 1 


System 2 


Human Erro 


Conditional 
Failure 
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Showing Uncertainty wrt Requirements vasa 


Notional 


1 in 1,600 | 
2,500 | 1,000 
1 in 1000 
ee | ee 
1,800 | 
r | 1in 150 | 
1 in 200 wp [+n 0 | 
1 in 30 .|s 1 in 10 
1/10000 1/1000 1/100 1/10 


Green Bar shows Requirement Value is met 
Red Bar shows Requirement Value is not met 
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® Notional Risk Drivers via Pareto 


(Top 80% of Calculated Risk) 


A Pareto chart like this can be made for each project, rig, platform, etc. 


Various 
Subsystems and 
Scenarios 
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1 in xxx Risk 


id 


% of Risk 
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Integrated Risk-Informed Design Assessment Sy 


Risk Trade Study for Proposed Change 


— Baseline from PRA and Achievability 
Study 


— addresses the three scenarios where 


System Risk Drivers 


gq 
zg 
af 
pF 


design change can reduce risk and the al 
additional risk associated with <n 
inadvertent operation Ss S 
— Result is an increase in probability of » L__ = 
LOM due to inadvertent ops with no eo et eee ere eee ay eee ee 
LOC advantage Baseline scenarios are #54 in Risk < 1% 
Facility Risk Drivers ¢ Bottom Line 
MPCV LRS Fails - EDL — 20% of mission risk is due to ABC 
ata (#1 Risk) and CBA (#2 Risk) 
MPCV - Unsuccesstul.. . . 
SLS SRB Breach _ addressed by proposed change is 
MPCV Software Failure =SLS insignificant 
Medical 
gem — Recommend spending resources 
0.00E+@000E-GH00E-(B100E-G100E-0400E-03 on top risk drivers 


Baseline scenarios are #234 in Risk < 0.2% 
May 2016 44 


When Should YouDoaPRA? (oy 


° As early in the design process as you can in order to 
affect the design and corresponding risk with 
minimal cost impact (i.e. to support Risk Informed 
Design (RID)) 


° When the risk of losing the project is greater than 
the company can live with either due to loss of life or 
for environmental or economic reasons 


°* To support Risk Informed Decision Making (RIDM) 
throughout a project’s life cycle from “formulation to 
implementation” or “concept to decommissioning” 
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How much does a PRA cost? asa 


°* As you can also ask, “How much will it cost to not 
do a PRA?” 


°* The cost of a PRA is a function of the level of detail 
desired as well as the size/complexity of the item 
being assessed and the mission life cycle 


— You should only model to the level of detail that you have data 
and no further. You may identify that significant risk exists ata 
sublevel, then your PRA is telling you that you need to study that 
level further. It may not be a PRA, but a reliability assessment at 
that time. 


— Modeling a drilling rig is on a different scale than just the BOP. 
However, understanding the need for a BOP can be important in 
its design and operation. 
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Absolute vs Relative Risk? vasa 


* You may have heard, “Don’t believe the absolute risk estimate, 
just the relative ranking”. 


° Each event in a PRA is assessed to having a probability of 
failure (Since the PRA is performed in “failure space”). 


— these failures are combined via the failure logic which is used to 
determine how they are combined and the resulting scenarios. 


— the failure probabilities of each event are used to establish the 
probability of each scenario thus ranks the scenarios as well as being 
added to produce the overall risk. 


— lf different approaches and methods are used (which sometimes are 
needed in full scope PRAs), then the absolutes can be challenged and 
so may their rankings. This is where experienced PRA analysts earn 
their pay to help minimize the difference. 
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° As aresult, some decision makers or risk takers want to know 
the overall risk, while others want to know how to reduce it by 


working on the top risk drivers first. 
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® Unknown and Underappreciated Risks Sy 


° Risk model completeness has long been recognized as a 
challenge for simulated methods of risk analysis such as PRA as 
traditionally practiced. 


yelaleal 


° These methods are generally effective at identifying system 
failures that result from combinations of component failures that 
propagate through the system due to the functional dependencies of 
the system that are represented in the risk model. 


° However, they are typically ineffective at identifying system failures 
that result from unknown or underappreciated (UU) risks, 
frequently involving complex intra- and inter-system interactions that 
may have little to do with the intentionally engineered functional 
relationships of the system. 
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Unknown and Underappreciated Risks S& 
(Cont'd) 


°* Earlier in 2009, the NASA Advisory Council noted the following set of 
contributory factors: 


— Inadequate definitions prior to agency budget decision and to external 
commitments 


— optimistic cost estimates/estimating errors 

— inability to execute initial schedule baseline 

— Inadequate risk assessments 

— higher technical complexity of projects than anticipated 
— changes in scope (design/content) 

— Inadequate assessment of impacts of schedule changes on cost 
— annual funding instability 

— eroding in-housetechnicalexpertise 

— poor tracking of contractor requirements against plans 
— Reserve position adequacy 

— lack of probabilistic estimating 

— “go as you can afford” approach 


— lack of formal document for recording key technical, schedule, and programmatic 
assumptions. 
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Why Do PRA? 


° What does a PRA tell you? 


— Inalarge percentage of cases, the PRA tells you, or confirms for you, what 
you thought you already knew 
> What it also does in these cases is document in a meaningful way why you 
thought this was true 
> PRAs See cay connect design, logic, operations, Human interaction and 
external influences for all aspects of large complex machines to detect 
dependencies and effects that the human mind just could not track and grasp on 
its own 
— Inasmall percentage of cases, the PRA results show something significant 
that you didn’t know 
> In these cases you may have a false sense of understanding and in fact the PRA 
has pointed out something that has been overlooked or: 


> ee is correct and there is a problem with the way something is modeled 
ina 


° What does performing the PRA tell you? 


— PRASs are recognized as tools that have enhanced the understanding 
between operations and engineers as to how the equipment really works, is 
used, and fails by promoting communication across disciplines and 
organizations. 


— It also gives a framework for resolving problems and failures. 
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Why Do PRA? (Cont’d) 


° PRAs are used to model and quantify rare events 
— lf we had 100,000 space stations operating for 40 years 


May 2016 


each with a catastrophic failure of 500 of them, we could do 
pretty standard statistics to estimate the probability of 
catastrophic failure of a space station. 
> However, we have only one space station and it has had 
minimal experience and no catastrophic failures. Therefore, 
there will rarely be any statistically significant data since it is in 
rare event territory. 
> PRA takes into account external events 
= Micro-meteoroid and orbital debris (MMOD) 
= Fire, etc. 
> PRA takes into account Human Error and Common Cause 
> PRA links functional dependency of systems and operations 
> PRA performs uncertainty analysis 
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In Closing S& 


¢ There is much more to know about PRA than what you’ve seen 
today. This presentation was to give you insight in order to ask 
the right questions when you are trying to decide: 
o whether you need a PRA or not, 
o Is it being performed properly and by qualified analysts, 
o Is it answering the question(s) you need answered. 


¢ PRA (with the help of deterministic analyses) identifies and ranks 
the risk contributors, the FMEA analysts and Reliability Engineers 
can help solve the problem by focusing on the top risk drivers. 
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Backup Charts S& 


Some Background 


° In late fifties / early sixties Boeing and Bell Labs developed Fault Trees 
to evaluate launch systems for nuclear weapons and early approaches 
to human reliability analysis began 


° NASA experimented with Fault Trees and some early attempts to do 
Probabilistic Risk Assessment aie in sixties (most notably on the 
Apollo rogram) but then abandoned it and reduced quantitative risk 
assessmen 


° Nuclear power industry picked up the technology in early seventies 
and created WASH-1400 (Reactor Safety Study) in mid seventies. 


— This is considered the first modern PRA 


— Was shelved until Three Mile Island (TMI) incident happened in 1979. It was 
determined that the WASH-1400 study gave insights to the incident that could not 
be easily gained by any other means. 


° PRA is now practiced by all commercial nuclear plants in the United 
States and a large amount of data, methodology and documentation 
for PRA technology has eine by the industry and the 
Nuclear Regulatory Commission (NRC) 


— All new Nuclear Plants must license their plants based on PRA as well as 
“Defense In Depth” concepts. 


— The NRC practices its oversight responsibility of the commercial nuclear industry 
using a “Risk” based approach that is heavily dependent on PRA. 
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Acronyms and Definitions nsal 


1. Cut set: Those combinations of items that can cause a failure of the type that 
you are interested in. A “minimum cutset” is the minimum combination of 
items necessary to cause the failure of interest. 


. End State: The consequence of interest that is defined for what your model is 
supposed to calculate (sometimes will be referred to as a Top event or Figure 
of merit depending on model type). 


. Top event (Top): The top event in a fault tree or a pivotal event in an event tree. 
If an event tree uses a linked fault tree to calculate a pivotal event then the 
pivotal event name and Fault tree “Top” name need to be identical. 


2 
3 
4 MLD: Master Logic Diagram. Used to identify all possible initiators. 
6 


. Event Tree: A logic tool that is used to model inductive logic and quantify 
models using Boolean logic. Can be linked to other event trees and can use 
fault trees linked to it. 


. Fault Tree: A logic tool that is used to build deductive models of een or 
rocesses and is quantified with Boolean Logic. Can be linked to Event Trees 
or a linked fault tree model. Built from top down and quantified from bottom 
up. 

7. PRA: Probabilistic Risk Assessment: A technique used for evaluating rare 
events for complex systems or processes. Attempts to account for all possible 
events that can cause the “end state”, “Top event”, “Figure of Merit”. Uses 
fault trees, event trees and other methods to “infer” the probability of events of 
interest. Better definition later. 


8. Rare Event: An event that has a small probability of happening. From a data 
point of view, it will have never been seen in practice or seen only rarely. It will 
not have enough data to be statistically significant. From the “rare event 
approximation point of view it is a probability that is 0.1 or less. 
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Acronyms and Definitions al 


(continued) 


9. LOC: Loss of Crew: A common “end state”, “top event” consequence, or 
“Figure of Merit” that we are interested in at NASA. 


10. Lom: Loss of Mission; A common “end state”, “top event”, consequence, 
or “Figure of Merit” that we are interested in at NASA. 


11. Risk: Probability or Frequency, times consequences 


. “And” gate: A logic symbol used in Fault Trees that multiplies inputs to it. 
In Boolean algebra it defines the “intersection” of events that are put into it. 


13. “or” gate: A logic symbol used in Fault trees that adds inputs to it. More 
accurately, in Boolean Algebra” it is the “union” of events that are put into it 


14. Bathtub Curve: This is a curve shaped like a bathtub that represents infant 
mortality or break-in failures early in a component or systems life and wear- 
out or aging late in life with a relatively constant or flat line connecting them. 
The flat line or constant failure rate implies that failure rates are random and 
independent of time. 


15. Infant mortality: The portion on the bathtub curve that is on the front end 
showing that failure rates are improving (becoming smaller) as time 
increases. 


16. Aging: The Portion on the Bathtub curve that is on the back end that shows 
the failure rates increasing as components wear out or age. 


17. Exponential Distribution: This is the distribution or equation that we use to 
represent the flat part of the bathtub curve (constant failure rate) and our 
PRA models that rely on the failure rates being random with respect to time. 
For reliability it is e"' and in failure space it is 1-e™ 


JSC S&MA Analysis Branch 
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Acronyms and Definitions al 


(continued) 


18. Time Rate of Failure: Failures that are defined as a rate of failure per time 
interval (e.g. failures per hour) 


19. Demand Failure: Failures that are defined as a failure per demand. 


. Conditional Probability: This is a probability of occurrence that is pre- 
conditioned on a specific set of circumstances that precedes it or is 
concurrent with it. 


21. Frequency: This is a rate (usually per time but can defined per other 
parameters such as demands etc.). This is a number greater than 0 but not 
necessarily less than 1. 


22. Probability: Dimensionless number between 0 and 1. Describes the 
likelihood of something happening. 


20. Minimal Cutset: A “minimum cutset” is the minimum combination of items 
necessary to cause the failure of interest. 


24. ESD: Event Sequence Diagram: This is a tool sometimes used to help 
explain the flow of an event or events and can be directly represented by an 
event tree. It uses inductive logic. Relatively few computer software 
programs will quantify ESDs. 


25. Lambda: This is a rate of failure. Often uses the Greek symbol |. Most of 
the time this will be a time rate of failure but can also be used to represent a 
“demand rate of failure”. 


26. i: Greek letter Lambda often used to show a failure rate. 


JSC S&MA Analysis Branch 
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” Acronyms and Definitions & 


continue 


2/. Lognormal Distribution: This is a distribution of events that if graphed on log 
paper it would show a normal distribution. It is a distribution often used in the 
PRA world to define the uncertainty of Lambda (A). 


EF (Error Factor): This is a parameter used to help define the width of a lognormal 
distribution. It is defined as the 95th/50th = 50th/5th = Square root of 95th/5th . 

We will often times approximate a result of an uncertainty evaluation with a 
Lognormal distribution when it is in fact not a lognormal or any other kind of 
distribution but a lognormal does a good job of approximating it. In such cases 
we always try and use the definition of EF= Square root of 95th/5th. 


Fussel Vessely (FV): Fussel Vesely importance measure. Represents how much 
of a components failure is contributing to the Top event or end state. Often 
expressed as a percentage it is not really and will be covered later. 


Risk Increase Ratio (RIR): This is another importance measure that will tell you 
how much a Top Event or End State will increase if you set an items probability of 
failure to 1 and recalculate the end state or top event. It is equivalent to RAW. 


Risk Achievement Ration (RAW): This is another importance measure that will tell 
you how much a Top Event or End State will increase if you set an items 
cae of failure to 1 and recalculate the end state or top event. It is equivalent 
to ‘ 


28 
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” Acronyms and Definitions & 


continue 


32. Risk Reduction Ratio (RRR): This is another importance measure 
that will tell you how much a Top Event or End State will decrease if 
you set an items probability of failure to 0 and recalculate the end 
state or top event. It is equivalent to RRW. 


Risk Reduction Worth (RRW): This is another importance measure 
that will tell you how much a Top Event or End State will decrease if 
you set an items probability of failure to 0 and recalculate the end 
state or top event. It is equivalent to RRR. 


34. Common Cause Failure (CCF): This is a failure cause that can 
result in multiple failures of identical redundant equipment within a 
short time span therefore reducing the advantage of having 
redundant equipment. (e.g. contaminated lube oil fails multiple 
pumps in a redundant system). 


Big Stew (BS) extra credit: This is a method defined by the 
incredibly brilliant Mark Bigler and Mike Stewart in order to model 
inter-phase dependencies using a linked fault tree model. The only 
reason Bigler is allowed to have top billing is so we can get a good 
and memorable Acronym (BS). It is also okay to consider the Big in 
“Big Stew” to be a modifier of Stew. 
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Basic Probability Info 


Some fundamental information about 
different ways we use failure information 


May2016 © 


Bathtub Curve vasa 


Infant Constant Wear-out or 
mortality / Failure rate age related 
This is where we operate as far 
as our model is concerned 
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Distribution Model is Based On 


* Our PRA model is based on the Exponential 
Distribution 


In reliability soace: P,=e” 

In failure space: P,;= 1 - et 

For small values of At, P;= At (Rare event Approximation) 

X is constant (i.e. we are on the bottom of the bathtub curve) 


id 


* Do not confuse this with the uncertainty distribution 
that we give to A. 
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Demand Failures vasa 


We have discussed time rate of failures (see previous page) 


When items are shut down and started they need to be modeled 
with a failure to start, or some items fail to work when called on. 
These are called “Demand” failures 
We can use a demand failure rate we define as A, and ca 
estimate a failure probability by taking this “failure rate” ( call it 
a rate but it is not specifically a rate of time but a rate of 
demands) and multiplying it by the number of demands (D). 
— Probability of Failure = 2,x D as long as this value is relatively small 
— Wecan write an equation similar to a time rate of failure probability: 
> Piy=1- er 
° HRA, valves failing to open on demand or close on demand, or 
motors failing to start on demand etc. are demand failures and 
should be modeled with demand failure rates not time failure 
rates. In many cases a motor needs to have two failures 
modeled 
— A failure to start on demand 
— A failure to continue running 
° This is true of standby equipment that is redundant that is not 
running and needs to be started to fulfill its safety function. 
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Conditional Probabilities vasa 


° Based on a condition that has been established (B) 
what is the probability of a particular event (A) 
happening or Given B what is the probability of A 

— Written as P(AIB) 


* Example: Given that a tire has blown what is the 
probability that the landing gear will collapse? 


* In principle the probabilities given in succeeding 
nodes on a path through an event tree are 
conditioned on what has happened before. 


— So anode could have different probabilities based on what has 
happened prior or which path it is on. 
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® Conditional Probability Continued vasa 


/B= 1-B and is called the “compliment” of B and can be 
written in different formats 
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Frequencies vs Probabilities 


°* Technically you could argue probabilities are frequencies although they are defined 
as dimensionless (also a probability has to be between 0 and 1 and a frequency can 
be larger than 1) 


> That is we need to insure if our initiator is Probability of failure per six months of operation (this is a 
frequency) that the mission time in our probability calcs for the pivital events is done for six months 


— We have a failure rate (a frequency) that is multiplied by a time period (mission time) and if we use 
rare event we get the following equation: 
> If oe = 1E-5 per hour and mission time =1000 hours then the probability = 1E-2 (or Lamda x mission 
time 
> However, we still need to remember that this is the probability (a dimensionless number) for an event 
happening in a 1000 hour time frame 
° For Space Station we always do our calcs for a mission time 


° Even demand failures are a rate of sorts (failure per demand). The number of 
demands is dictated by the number of demands that are expected per cycle or per 
six month period of time etc. 


° Typically the front of an event tree (the initiator) is a Frequency (that is why it is 
treated differently in SAPHIRE. Probabilities all have to be between 0 and 1a 
frequency does not. 

— We could have a frequency of initiation of 10 losses of a system per year in some analysis. If this 
frequency is small (less than one) it can often times be treated like a probability but it still carries a 
per hour or per demand etc. value 

° In practice we often use probabilities and frequencies interchangeably and as long 
was we keep track of what we mean it is okay (probably careless and sloppy) but we 
can’t confuse them. 

— By definition the outcome or endstate ends up being a frequency (the intiator which is a frequency 
times all the pivital events which are probabilities. 


— Sowhen we do our event trees we need to insure the mission time matches what our initiator 
frequency is 
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Quick and Dirty Calcs asa 


°* When you need to do a calc fast (in a 
meeting or to check a more major calc) 


— Use rare event if appropriate (that is the time rate of 
failure or the demand rate of failure times their 
respective mission times or number of demands do 
not exceed ten percent) 

> Even here to do a quick check or sanity check using rare 
event will give you a conservative upper bound even if you 
exceed the 10% value 

— Sometimes it is easier to do the calc in reliability 
space than in failure space and then convert back 

> Remember 
probability of failure = 1 - probability of success 
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Examples of easy calcs asa 


9 


°* Probability of failure of tethering is 1E-3 per tether 
attempt, there are 400 estimated tethers in the next 5 
years. What is the probability over 5 years that we 
fail to tether? 
— Builda fault tree with 400 basic events of failure to tether going 
through an “or” gate (not easy) 
— Solve using a binomial distribution (not easy, for me anyway) 


— Solve using rare event: 400 x 1E-3 = 0.4 (this is above the 10% 
value for use of rare event but gives a conservative upper bound 
estimate) 


— Solve using success space probability of success is 
(1 - 0.001)4°° = 0.67 so probability of failure is 1 - 0.67 = 0.33 
— Use 1 - e*, where lambda is demand failure rate to get 0.33 
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Data Analysis 


DATA ANALYSIS vasa 
TYPES OF DATA THAT EXIST IN THE MODELS 
° Functional — A functional failure event is generally defined as failure of a 
component type, such as a valve or pump, to perform its intended function. 
Functional failures are specified by a component type (e.g., motor pump) and by 
a failure mode for the component type (e.g., fails to start). Functional failures are 
enerally defined at the major SE a level such as Line Replaceable Unit 
(LRU) or Shop Replaceable Unit (SRU). Functional failures typically fall into two 


categories, time-based and demand-based. Bayesian update as Shuttle specific 
data becomes available. 


° Phenomenological — Phenomenological events include non-functional events 
that are not solely based on equipment performance but on complex interactions 
between systems and their environment or other external factors or events. 
Phenomenological events can cover a broad range of failure scenarios, including 
leaks of flammable/explosive fluids, engine burn through, overpressurization, 
ascent debris, structural failure, and other similar situations. 


* Human — Three types of human errors are generally included in fault trees: pre- 
initiating event, initiating event (or human-induced initiators), and post-initiating 
event interactions. 
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° Common Cause — Common Cause Failures (CCFs) are multiple failures of similar 
pomp. within a system that occur within a specified period of time due to a 
shared cause. 


° Conditional — A probability that is conditional upon another event, i.e. given that 
a erent has already happened what is the probability that successive events 
will fai 
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FUNCTIONAL DATA ANALYSIS (2) vasa 


DATA SOURCES 
e NASA’s PRACA databases are sources for Shuttle specific failure data 
¢ Prime contractor data, when available 


¢ Non-electric Part Reliability Database (NPRD) is a generic data source 
for run time failure data for mechanical components 


e Electric Parts Reliability Data (EPRD) is a generic data source for run 
time failure data for electrical components 


¢ Nuclear Computerized Library for Assessing Reactor Reliability 
(NUCLARR) is a generic data source for on demand failures 


¢ Expert Opinion 


e« Miscellaneous references 


May 2016 71 


FUNCTIONAL DATA ANALYSIS (4) vasa 


BAYESIAN UPDATING OF FUNCTIONAL FAILURES 


° What? 
— Itis arecognized, and standard, practice for functional failures 
— Utilizes generic databases 


— Applies a statistical technique to allow Shuttle data to update the 
generic values 


° Why? 
— Provides a tool to utilize sparse data from the Shuttle to generate more 
accurate estimates of failure rates 


— Provides a less conservative way to estimate failure rates for 
components with zero failures 


° Inputs 
— Total hours of operation or number of demands for a component 


— Number of failures experienced (derived from CAR screening and input 
from Engineers) 
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FUNCTIONAL DATA ANALYSIS (5) neal 
BAYESIAN UPDATING 


° Performed on risk significant components 


— List of risk significant components from iteration 2.2 of the Shuttle PRA 
> Since the list was based on prior model there can be some components that show up as significant in 
iteration 3.0 that have not been screened. These will be screened for iteration 3.1. 
— Components in the top 99% or with RAW greater than 1.1 (RAW measures the change if the 
component failure is set to 1.0 in the model) 


° CARs were screened from first flight until 12/31/2005 


° Only considered KSC and in flight failures 


— Vender failures were screened out due to inability to capture corresponding 
operating/demand data 


° Partial failures were included only if there were no hard failures and were 
assigned either a 0.5 or a 0.1 value depending upon the severity of the failure 
— These values came from NUREG/CR-6268, Volume 3 


— 0.5 was assigned if the component would have been capable of performing some portion of 
the safety function and was only partially degraded. 


— 0.1 was assigned if the component was only slightly degraded. 
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° Failures were discounted based upon corrective action 
— If sufficient information was available the “fix factor” was calculated by taking the failure rate 
before the fix and dividing by the failure rate after the fix 


— If sufficient information was not available the “fix factor” was assumed to be one of the 
following depending upon the type of corrective action 
> 50% for design changes that were described as “improvements” or procedural changes 
> 90% for design changes that “eliminated” the failure mode 
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= PHENOMENOLOGICAL DATA DEVELOPMENT (2) vasa 


SPLAT (SHUTTLE PRA LEAK ANALYSIS TOOL) 


4.2E-03 


SPLAT calculates the probably of a leak 
occurring, then determines the probability Strength 


= = =Stress 


that the leak exceeds the critical leak 
size. It is a standard stress-strength 
model and where leaks are stresses and 
the critical leak size is the strength. 


2.1E-03 


0.0E+00 


500 700 900 1100 1300 1500 
Inputs are entered as distribution ee chennai Aaleeian ioake 
parameters and results are calculated Probability Size Size 
using Monte Carlo sampling. Exponential Exponential Exponential 
Lognormal Lognormal Lognormal 
Normal Normal Normal 
Gamma Gumbel 


Point Estimate Uniform 
Point Estimate 
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HUMAN RELIABILTY ANALYSIS (HRA) DATA 
DEVELOPMENT 


° HRA is a method used to describe, qualitatively and quantitatively, the 
occurrence of human failures in the operation of complex machines that affect 
availability and reliability. 


° Modeling human actions with their corresponding failure in a PRA provides a 
more complete picture of the risk and risk contributions. 


° A high quality HRA can provide valuable information on potential areas for 
improvement, including training, procedural and equipment design. 


* Screening analysis is performed on the bulk of the human errors with a 
detailed analysis only performed on the significant contributors 


° There are Many Different Methodologies for Model Human Errors in PRA 
— For the Shuttle PRA Cognitive Reliability and Error Analysis Method (CREAM) was selected 
as the primary method for detailed analysis 
> It was selected as one of the NASA recommended HRA techniques 
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— The results from CREAM have been favorably benchmarked against other methodologies 
and simulator data as part of the Shuttle PRA 


— The majority of HRA events are processed with a screening analysis that is essentially based 
on the Technique for Human Error Reliability Prediction (THERP) in NUREG/CR-1278. 
THERP is a recognized HRA technique that has been used for over 20 years, primarily in 
calculating Human Error Probability (HEP) in nuclear power plant PRAs. 

> The screening table was easy to apy and gave conservative values. If an HRA event that was 


covelcnee using the screening table became a significant contributor it was then re-modeled using 
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HRA S 


Comparison of Simulation Data and CREAM Results 


Go CREAM 
© “Land Too Hard” 
7 SIM 


44 Failure to Lower CREAM 


=i Landing Gear” 
SIM 


<eBrake at Wrong CREAM 


SIM 


1.E-06 126-05 1.E-04 1.E-03 1.E-02 


The Cream results have since been Bayesian updated using 
the simulator data 
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® COMMON CAUSE DEFINITION S& 


° In PRA, Common Cause Failures (CCFs) are failures of two or 
more components, subsystems, or structures due to a single 
specific event which bypassed or invalidated redundancy or 
independence at the same time, or in a relatively short interval like 
within a single mission 


- May be the result of a design error, installation error, or maintenance 
error, or due to some adverse common environment 
- Sometimes called a generic failure. 


° Common Cause, as used in PRA, is not a single failure that takes 
out multiple components such as a common power supply to 
computers or common fluid header to multiple pumps. 
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- Single point failures such as these are modeled explicitly ina PRA 
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COMMON CAUSE MODELING S& 


(More details and examples on this later) 


° All large PRAs of complex and redundant machines must include 


“common cause” effects to be complete and accurate 


* Common Cause are those conditions that defeat the benefits of 
redundancy 
— Not “single point failures” 
— Similar to “generic cause” 


° There are three recognized ways to perform common cause modeling: 
— The Beta Model 
— The Multiple Greek Letter Model 
— The Alpha Model 


° We use an iterative approach to modeling common cause first the 
Beta Model approach is used and if it shows up as a risk driver a 
Multiple Greek Letter Model is used 


°* Generic data from NUREG/CR-5485 for the majority of the events since 
there are few cases where there is enough Shuttle data to develop 
Shuttle specific values 
— RCS Thrusters and ECO sensors are examples of cases where Shuttle specific 
data is used to calculate the common cause parameters 
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COMMON CAUSE MODELING (2) S& 


HOW THE BETA MODEL APPROACH WORKS 


° Susceptibility groups (groupings of similar or identical equipment) of 
redundant trains or components are identified 


° Acommon cause basic event is defined for these groups 


° The common cause basic event failure rate is generated by taking the 
independent failure rate times a “Beta” factor. 
— For the beta model it does not matter how many components are in the group 


— The “Beta” factor represents the probability of 2 or more failures given a failure has 
occurred 
> For this reason, the Beta Model may be conservative for component groups larger than 2. 


° The “Beta” factor is taken from NUREG/CR-5485 and has a different 
value for “Operating” failures vs. “Demand” failures 
— Operating failures the “Beta” value is 0.0235 
— Demand failures the “Beta” value is 0.047 
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® COMMON CAUSE MODELING (3) af 


HOW THE MULTIPLE GREEK MODEL APPROACH WORKS 


¢ Similar to the Beta Model except that the Multiple Greek Model takes credit 
for the full redundancy and therefore can be much more complicated 


— Fora3component group, there is a “beta” factor and a “gamma” factor where 
the “beta factor is still the probability of 2 or more failures and the “gamma” factor 
is the probability of 3 or more failures given 2 or more failures. 
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CONDITIONAL PROBABILITY Sy 


° Given that an event has already happened what is the probability that 
successive events will fail 


- Example : Given two blown tires in the time interval between main gear touch 
down and nose gear touch down what is the probability that the Orbiter crashes 
(i.e. strut fails or crew looses control of vehicle) 


* Conditional probabilities are typically relatively large (e.g. values like 
0.1 to 0.9) and are usually derived from expert opinion or direct 
experience. 
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® CONCLUSIONS S& 


¢ Like redundancy helps but may not help as much as you think because 
there is a point of diminishing returns with like redundancy 


¢ Redundant but diverse designs can defeat common cause and supply the 
best reliability 


¢ Failure to model common cause will lead to underestimation of the risk 


¢ Common cause parameters based on real data are hard to derive due to 
a lack of data 


¢ A high common cause parameter does not mean that a component is 
unreliable, it just means that given that one component has failed, 
additional similar components are more likely to fail 
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Reading a Fault tree vasa 
( A Very Basic Explanation) 


Fault trees are often used to perform Probabilistic Risk 
Assessments (PRA). A basic understanding of how to 
read a fault tree is needed. The following few slides 
describe a few of the most commonly used symbols used 
to build fault trees and gives a very basic example. The 
symbols shown in this document are specific to the 
SAPHIRE computer program but generally conforms to 
most fault tree symbols. In some cases the symbols are 
demonstrated by using the “Graphic” editor symbols in 
SAPHIRE and in some cases the “Logic” editor symbols 
are used. 
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Fault Trees vasa 


° Examples of Fault trees densienal for Shuttle systems: 
+ Electrical Power 
« Auxiliary Power Unit 
« Hydraulics 
« ECLSS 
> Etc. 
Includes hardware, software, human errors, 
Includes common cause failures 
°* Fault trees show interdependencies among distributed 
systems by including the interactions with all 
supporting equipment 
— MDMs 
— Coldplates 
— RPCMs /DDCUs 
— Environmental controls 


JSC S&MA Analysis Branch 
& 
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“And” gate: 


“and” gate takes whatever prebabilifes that are in- out to it and 
multiplies them together. 


An "AND" gate 
multiplies inputs 
that go into it 
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e“or gate: asa 


dds cient in Boles apne the adding | isa little more 
involved. If the probabilities of A or B, are put into the “or” gate 
the algebraic equation isA+B-—-Ax B. If the probabilities are low 
(i.e. less than .1) then the answer can be approximated by just A + 
B (also known as the “rare event approximation”) 


An "OR" gate 
adds inputs to it 
together 
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e “N of M” gate: 
The “N of M Sane is used to define combinations of M oe 


shorthand for oon | this. An ae would be to take three 
items “A, B and C” two at a time to get the following: AB, AC, and 
BC. 


The N of M gate 
takes combinations 
of N of M inputs to 


fail 
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most basic (lowest) 


level that we model to. There is a tendency to model 
down to too low a level of detail. However, it is a mistake to model down 


to a lower level than data can be acquired to represent the failure 
probabilities for that item. 


The basic event is 


the probability of 
failure of an item 


i 3.500E-3 


BASIC EVENT 
Also 


88 


“phe “Transfer gate”: 
a ~ The ‘transfer gate" is used to connect aed of fault trees 


fit onto a single piece of paper to be n more easily printed out and 
read or also if several fault trees use the same equipment then 
the transfer can be used to model that equipment once to be 
used in many different places in other trees. 


A transfer 
connects other 


trees at this point 


TRANSFER 
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Description Goes 
Here 


Undeveloped Event 


The undeveloped event denotes a basic event that is actually a more 
complex event that has not been further developed by fault tree logic. 
SAPHIRE treats this event no differently than a basic event. 


House Event 


The house event denotes a failure that is guaranteed to occur (TRUE) or 
never to occur (FALSE). However, the calculation type assigned to a 
basic event establishes whether or not an event is a house event. 
Consequently, any basic event in SAPHIRE can be a house event, but 
the calculation type dictates the analysis behavior (see Section 5). 


Undeveloped Transfer 
The undeveloped transfer indicates that the event is complex enough to 


have its own fault tree logic developed elsewhere; however, the event 
has been treated as a basic event in the present fault tree. 
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Some Fault Tree Basics 


-A+B—AxB =AxB 


“OR” gate “AND” gate a 


A 
Basic events 
Boolean algebraic identities 
(Just a few basic ones are given) 
Additive Identities: Multiplicative Identities*: 
A+0=A 0A=0 
A+l=1 1A=A 
A+A=A AA=A 
A+A=1 AA HG 


A=A+AB 


“Note: Multiplication (Logical AND) is implied when two variables are written next to each other. 
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° Example: 


— To demonstrate the use of the symbols to model a system an 
example fault tree is done and represented in Figures 1 and 2. 
Figure 1 is the fault tree for system “Station” and figure 2 is a 
piece of the fault tree that is modeled separately and is 
connected by a “Transfer” gate (the “transfer” gate name must 
be the same name as the top of a tree that is being transferred). 
In Figures 3 and 4 we find the same set of logic represented 
using the “Logic” editor portion of SAPHIRE. The logic editor 
graphics give a more compact version of the logic and is 
sometimes preferable to use since it reduces the number of 
pages needed to represent the tree. 
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Failure of System 
"Station" 


EXAMPLE 


output of gate either E or F or 
"N-M" is items from 
multiplied by "D" Transfer XYZ fail 


Any combination item "D" fails 
of 2 of 3 items A, 
B, or C fail 


item "E" fails item F fails 


XYZ fails 


"A" fails Item "B" fails Item "C" fails 


XYZ 
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Figure 2. 


XYZ fails 


XYZ 


X fails Y fails Z fails 


94 


: “A example OR Failure of System "Station" 
Figure 3. Hi and_1 AND output of gate "N-M" is multiplied by "D" 
| }® D (1.000E-003) item "D" fails 
| HH n-m 2/3 Any combination of 2 of 3 items A, B, or C fail 
| © A (1.000E-003) item "A" fails 
| |} B (1.000E-003) Item "B" fails 
|  “® C (1.000E-003) Item "C" fails 
aan | or_l OR either E or F or items from Transfer X YZ fail 
E (1.000E-005) item "E" fails 
I+} F (1,000E-005) item F fails 
‘GA xyz TRAN XYZ fails 


Figure 4. AB xyz AND XYZ fails 
+O X (2,000E-002) X fails 
HO Y (2,0008-002). Y fails 
) 7 (2.0006-002) Z fails 
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any: Cut Sets 


Cut % % Cut Set Prob./Frequency Basic Event Description Event Prob. 
No. Total 
1 35.71 35.71 1.000E-005 E item "E" fails 1.000E-005 
us 71.42 35.71 1.000E-005 F item F fails 1.000E-005 
3 99.99 28.57 8.000E-006 X X fails 2.000E-002 
i Y fails 2.000E-002 
Z Z fails 2.000E-002 
4 99.99 0.00 1.000E-009 A item "A" fails 1.000E-003 
B Item "B" fails 1.000E-003 
D item "D" fails 1.000E-003 
5 99.99 0.00 1.000E-009 A item "A" fails 1.000E-003 
C Item "C" fails 1.000E-003 
D item "D" fails 1.000E-003 
6 99.99 0.00 1.000E-009 B Item "B" fails 1.000E-003 
D item "D" fails 1.000E-003 
C Item "C" fails 1.000E-003 
Grand total~ | 2.8E-5 


In this example we did not consider common cause. More about that 


later. 
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Simple System Fault Trees and 
Minimal Cutset Problems 
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DRAW A FAULT TREE FOR THE 
SYSTEM BELOW, THE TOP EVENT 
OF THE FAULT TREE IS “ROOM DARK” 


f Light 1 
belies | 
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ed 


A SOLUTION FAULT TREE FOR PROBLEM 


Undesired 
dark event 


Power 
fails 
E, X, 
reer er STEER ti te ee NRT SR EE 


Basic events Basic events 


The Cut set Solution to the Model 


Cut No. % Total % Cut Set Prob./Frequency Basic Event Description Event Prob. 
1 82.04 82.04 5.000E-003 El Power fails 5.000E-003 
2 98.45 16.41 1.000E-003 x3 Fuze blown 1.000E-003 
a 100.00 1.64 1.000E-004 Xl Light 1 burned out | 1.000E-002 
X2 Light 2 burned out | 1.000E-002 
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EXAMPLE OF MINIMAL CUT SET GENERATION 


GENERATION OF THE MINIMAL CUT SETS FROM 
A FAULT TREE REQUIRES FOUR STEPS 


STEP 1 GENERATE THE INTERMEDIATE EVENT EQUATIONS FOR THE 
FAULT TREE 
STEP 2 GENERATE AN EQUATION FOR THE TOP EVENT THAT IS A 


FUNCTION OF ONLY BASIC EVENTS 


STEP 3 REDUCE THE EQUATION GENERATED IN STEP 2 BY THE BOOLEAN 
LAWS OF ABSORPTION 
- PeP =P 
- P+PeQ =P 

STEP 4 WRITE THE EQUATION GENERATED IN STEP 3 IN A SUM-OF- 


PRODUCTS FORM 


EXAMPLE OF MINIMAL CUT SET GENERATION 


UT SETS. INIMAL CUT SETS 
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